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Two canonical problems in geostatistics are estimating the pa- 
rameters in a specified family of stochastic process models and pre- 
dicting the process at new locations. A number of asymptotic results 
for these problems over a fixed spatial domain indicate that, for a 
Gaussian process with Matern covariance function, one can fix the 
range parameter controlling the rate of decay of the process and ob- 
tain results that are asymptotically equivalent to the case that the 
range parameter is known. We discuss why these results do not al- 
ways provide the appropriate intuition for finite samples. Moreover, 
we prove that a number of these asymptotic results may be extended 
to the case that the variance and range parameters are jointly esti- 
mated via maximum likelihood or maximum tapered likelihood. Our 
simulation results show that performance on a variety of metrics is 
improved and asymptotic approximations are applicable for smaller 
sample sizes when the range parameter is estimated. These effects are 
particularly apparent when the process is mean square differentiable 
or the effective range of spatial correlation is small. 



1. Introduction. The analysis of point-referenced spatial data, often 
referred to as geostatistics, relies almost exclusively on a single construct: 
the second-order stationary Gaussian process with a parametric mean and 
covariance function. Exceptions may be found, some of them notable, but in 
almost all elaborate hierarchical or nonstationary models in the literature, 
one can find structures built from stationary Gaussian processes. 

Given the prominent role of the stationary Gaussian process, it is per- 
haps surprising that the theoretical properties of inference under this model 
remain incompletely understood. Consider a canonical problem in geostatis- 
tics, that of predicting the value of a spatial process at locations not con- 
tained in the dataset. The steps of the analysis using a Gaussian process 
can be characterized as follows: choose a parametric mean and covariance 
function, estimate the model parameters from the data, then plug the es- 
timated parameters back into the model to make predictions and compute 
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standard errors at unobserved locations. Discussion of these related steps can 
be found in most spatial statistics textbooks, but see for example Cressie 
(1993, Chapters 2 and 3), or see Banerjee, Carlin and Gelfand (2004) for a 
Bayesian treatment. Stein (2010) gives a succinct overview of asymptotic 
issues for both estimation and prediction. 

Covariance model choice, estimation, and prediction have each received 
considerable theoretical attention, although often in isolation from one an- 
other. A thorough treatment of the problem of model choice may be found 
in Stein (1999). In this work and elsewhere, Stein makes a compelling case 
for using the Matern covariance model for the Gaussian process {Z(s),s G 
D C sft d }, with 

(1) 

Cov(Z( Sl ),Z( Sj )) = a 2 K(s t - Sj ;p,u) = ^^MML ^Si - Sj \\/p) : 

for a 2 ,p,u > 0, where K, u is the modified Bessel function of the second 
kind of order v (Abromowitz and Stegun, 1967, Section 9.6). The range pa- 
rameter p controls the rate of decay with distance, with larger values of p 
corresponding to more highly correlated observations. This model is partic- 
ularly attractive because of its flexibility in representing the smoothness of 
the Gaussian process, with any number of mean square derivatives being 
possible, according to the value of v (Stein, 1999). 

Zhang (2004) provides influential results concerning the consistency (and 
inconsistency) of parameter estimates for the Matern model under infill, or 
fixed-domain asymptotics. Infill asymptotics requires that the sampling do- 
main be fixed as the number, and hence density, of observations increases to 
infinity. These results follow from a more fundamental result in Zhang (2004) 
concerning equivalence (mutual absolute continuity) of Gaussian measures 
on bounded domains. 

A theoretical treatment of the third component of geostatistical analyes, 
prediction and corresponding standard error estimation, has been developed 
in a series of works by Stein (1988, 1990, 1993, 1999). These works provide 
conditions under which predictions using a mis-specified covariance function 
are asymptotically efficient and associated standard error estimates converge 
almost surely to their true values under infill asympotics. One such condition 
is that the mis-specified covariance be chosen so that the resulting Gaussian 
measure and the true one are equivalent, providing a link to the results in 
Zhang (2004). However, as we will discuss in Section 2.2, the nature of that 
link has sometimes been misinterpreted. 

Like most of the aforementioned works, we focus on the isotropic d- 
dimensional Matern covariance model. Our goal is to provide a holistic ac- 
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count of the asymptotic properties of the canonical geostatistical analysis 
under the Matern model, from parameter estimation and inference to pre- 
diction. This account includes the re-statement of key results, as well as the 
introduction of new results to fill important gaps. 

We devote particular attention to the range parameter p. One may de- 
tect in the recent literature a vein of reasoning that p is unimportant in 
practice. For example, Zhang and Wang (2010) make this argument in the 
context of prediction, and Gneiting, Kleiber and Schlather (2010) make a 
similar argument for using a single range parameter in a Matern model for 
multivariate random fields. These authors borrow intuition from the asymp- 
totic results of Stein (1988), Zhang (2004), and others, results that fix p at 
an arbitrary value, often for mathematical tractability. Each of these results 
presents some particular variation of the conclusion that fixing p at an incor- 
rect value is asymptotically just as good as using the true value. However, 
as we will show via a simulation study in Section 3, this intuition cannot 
necessarily be transferred so readily to the finite sample case. Several of our 
new results indicate that the same asymptotic behaviors also hold for the 
case of estimated p, and we demonstrate in the simulation that the approxi- 
mations provided by these results can be applicable for much smaller sample 
sizes than when p is fixed. 

In the Section 2 we state and extend a number of important results for the 
Matern model, moving from the case that p is fixed at an arbitrary value 
to the case that a 2 and p are jointly estimated via maximum likelihood. 
We also review and extend asymptotic results for prediction, although this 
theory does not yet allow for estimated p. In Section 3 we carry out a sim- 
ulation study that allows us to determine the cases in which estimation of 
the range parameter is important to performance and the cases in which it 
is not important. We show that instances in the literature where the range 
parameter was claimed not to matter correspond to particular scenarios and 
that this result does not hold more generally. In Section 4 we prove that 
similar arguments as in Section 2 for the maximum likelihood estimator can 
also be applied to an estimator maximizing an approximation using covari- 
ance tapering. We conclude in Section 5 with a discussion of the implications 
of these results and a number of ways in which assumptions we have made 
here might be relaxed. 

2. Asymptotic Theory for Estimation and Prediction. We begin 
with some notation and assumptions that will be used in all our results unless 
specifically stated otherwise. Let Z = {Z(s),s G D C $l d } be a stochastic 
process on a bounded domain D, with d = 1, 2, or 3. Let 67(0, a 2 Kg) denote 
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the mean zero stationary Gaussian measure for Z with marginal variance 
a 2 > and correlation function Kq, depending on parameters 9E0C K p . 
For a particular sampling design S n = {s\, . . . , s n S D} of distinct locations, 
we observe Z n = (Z(s\), . . . Z(s n )) T . Our tasks are to use Z n to estimate 
a 2 and 9 and to predict Z(sq) for some location sq € D, not in S n . Our 
results concern the behavior of these estimators and predictors as we take 
an increasing number of observations in a bounded domain D, called "infill" 
or "fixed-domain" asymptotics. 

Throughout, we focus our discussion on the Matern covariance function 
(1), and we use G(0,a 2 K p ^) to denote a mean zero Gaussian measure with 
this covariance function. We also assume that the smoothness parameter v 
is known. Although having the flexibility to estimate v is desirable, results 
in this more general case have thus far been difficult to obtain under the 
fixed-domain asymptotics. Indeed, of the handful of results that have been 
obtained for estimators under this framework, we know of none that do make 
not the assumption that v is known. Our focus is on the role played by the 
range parameter p in this model, namely to show that several important 
results that have been provable only in the case of fixing p at an arbitrary 
value can be extended to the case that p is estimated, and that it is often 
advantageous in practice to do so. 

The reason that it is justifiable to fix p, at least in an asymptotic sense, 
follows from a property of the Matern model shown by Zhang (2004). This 
result indicates that when the dimension d < 3, two Gaussian measures with 
the same v but different values of p can in fact be equivalent, so that it is 
impossible to distinguish between them even when observing Z{s) for all 
se D. 

Theorem 1 (Theorem 2 of Zhang (2004)). For fixed u> 0, G(0,a%K po>u ) 
and G(0, o\K px u ) are equivalent on bounded domains if and only if&o/pQ = 

Anderes (2010) showed that this result does not hold for d > 5, and the 
case d = 4 is still open. However, the assumption that d < 3 seems adequate 
for most spatial applications. 

The parameter c = a 2 / p 2u is what Stein (1999) calls a microergodic pa- 
rameter. Stein (1999, page 175) gives a "reasonable conjecture" for microer- 
godic parameters that we will show to be true in this case. He suggests 
reparameterizing into microergodic and non-microergodic components of the 
parameter vector, which we could choose to define as c and p, respectively. 
The conjecture is that if all model parameters are estimated by maximum 
likelihood, the asymptotic behavior of the MLE for the microergodic param- 
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eter, here c, is the same as if the non-ergodic component, here p, were known. 
In the next section, we outline existing results that concern the asymptotic 
behavior for the MLE for c when p is fixed at an arbitrary value, and we 
extend them to the case that p is estimated. This distinction is important 
in practice in a number of cases, as we highlight in the simulation study of 
Section 3. Section 2.2 turns to asymptotic theory for the prediction problem. 
The role of the microergodic parameter is critical here as well. 

2.1. Estimation of Covariance Parameters. Theorem 1 has an immedi- 
ate and important corollary for estimation. 

Corollary 1 (Corollary 1 of Zhang (2004)). Let S n be an increasing 
sequence of subsets of D. There do not exist consistent estimators of a 2 or 
p based on the corresponding sequence of observation vectors Z n . 

It is important to note what Corollary 1 does and does not imply. It 
states that as n — > oo, the sampling distributions of estimators of a 2 and p, 
including, for example, the maximum likelihood estimators, do not concen- 
trate their mass about the true parameter values. However, this does not 
mean that the data contain no information about a 2 and p individually. 
Indeed, in simulation studies we observe that sampling distributions for the 
maximum likelihood estimators can in many cases be quite concentrated 
about the true values, even as we know these distributions will not become 
ever more concentrated as n increases (Zhang, 2004; Kaufman, 2006). Some 
intuition behind this can be given by appealing to another asymptotic frame- 
work, that of increasing the domain of observations. Mardia and Marshall 
(1984) give regularity conditions under which the maximum likelihood es- 
timators for all model parameters are consistent and asymptotically nor- 
mally distributed under the increasing domain framework. Any finite set 
of observation locations could conceivably be a member in a sequence un- 
der either the fixed-domain or increasing-domain asymptotic framework. 
Zhang and Zimmerman (2005) suggest choosing between the asymptotic 
frameworks based on the degree to which asymptotic approximations hold 
for finite samples. They also note that the increasing domain framework 
can be mimicked by fixing the domain but decreasing the range parameter. 
Therefore, it is not surprising that when the true range parameter is small 
relative to the sampling domain, it can be well estimated from data. 

The likelihood function for a 2 and p under the Matern model with fixed 
v > based on observations Z n is 



(2) 
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where T n (p) is the matrix with entries K(s{ — Sj 4 , p, v) (i,j = 1, . . . , n) for 
K defined as in (1). We consider two types of estimators obtained by maxi- 
mizing (2). The first fixes p n = p\ for all n and maximizes C n (a 2 ,pi). The 
second maximizes (2) over both a 2 and p. In either case, the estimator of a 2 
may be written as a function of the corresponding estimator of p. That is, 
we may write al(p n ) = argmax^ £ n (a 2 ,p n ) = Z T T n (p n )~ l Z n /n, where p n 
is either p\ or the value the maximizes the profile likelihood for p. In most 
cases the latter estimator is not available in closed form and must be found 
numerically. 

We may likewise express the corresponding estimators of c = a 2 / ' p 2u as a 
function of p n , namely 



The following result, taken from Theorem 3 of Zhang (2004) and Theorem 
3 of Wang and Loh (2011), defines the asymptotic behavior of c n (pi) for an 
arbitrary fixed value p\ > 0. 

Theorem 2. Let S n be an increasing sequence of subsets of D. Then as 
n — > co, 

1- Cn(pi) — > o~o/Po u almost surely, and 

2. Vn{c n ( Pl ) - ol/pf) -> 7V(0,2(crg/p^) 2 ) in distribution 

under G(0, a 2 K P0:l/ ). 

Special cases of Theorem 2 were proven previously by Ying (1991), who 
treated the case that v = 1/2 with D an interval in the Ornstein- 
Uhlenbeck process, and by Du, Zhang and Mandrekar (2009), who treated 
the more general case with d = 1. 

A key contribution of the current paper is to show that Theorem 2 can be 
used as a stepping stone to proving that the maximum likelihood estimator 
c n (p n ) has exactly the same asymptotic behavior as does c n (po). We make 
use of the following lemma, which shows that c n (p n ) is monotone when 
viewed as a function of p ri . 

Lemma 1. Let S n = {si, S2, ■ ■ ■ , s n £ D C K d } denote any set of obser- 
vation locations in any dimension. Fix u > and define T n (p) to be the ma- 
trix with entries K(si — Sj] p,v) as in (1). Define c n (p) = Z^T^p)" 1 Z n j '{np 2v ). 
Then for any < p\ < P2, c n {p2) < c n {pi) for any vector Z n . 

Proof. Let < p\ < p2- The difference 



(3) 



C n {Pn) = 0-l(p n )/pl U = Z^T n (p n ) 1 Z n j ' (np 2 ») 



C n (pl) ~ C n {p 2 ) = Z n [ Pl V T n ( Pl ) 



-1 



p^T{p 2 )- 1 ]Z n /n 
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is non-negative for any Z n if the matrix A = / oJ~ 2 ' / r n (/Oi)~ 1 — P2 2u ^ n (p2)~ 1 
is positive semi-definite. By Corollary 7.7.4(a) of Horn and Johnson (1985, 
page 473), A is positive semi-definite if and only if the matrix B = p^Y \{p2)~ 
p\ u T n (pi) is positive semi-definite. The entries of B may be expressed in 
terms of a function Kb '■ ^ d — > 5i, with 

Bij = K B (si - Sj) = pl v K{\\si - Sj\\;p2,v) - p\ v K(\\si - Sj\\;pi_,v), 

and B is positive semi-definite if Kb is a positive definite function. Define 

(2ir) a JsRd 

pf [ e- iu > Tx K(x;p 2 ,v)dx- pj u [ e- iujT ' x K(x; Pl , v) dx 



(2vr) d J^d 

(4) = CSF 

Both integral terms in (4) are finite, with 

f e-^ T *K(x; p, v)dx = ^± d /% - 2v (p- 2 + \\u\\ 2 ^ v+d ^ 



(2n) d Md v 1 ir d / 2 T(is) 

the spectral density of the Matern correlation function. Therefore, 

am - ^r^W + M'r<>^> - w 2 + m' )-<*<» 

To show i^e is positive definite it suffices to show is positive for all ui. 

This is clear because < p\ < p2- Therefore c n (p2) < c n (pi) for any vector 
Z n . □ 

We can now make use of Theorem 2 in proving a more general result for 
the maximum likelihood estimator. 

Theorem 3. Let S n be an increasing sequence of subsets of D. Suppose 
(o"0'Po) T G (0,oo) x [p L ,Pu], for any < p L < pu < 00. Let {d- 2 n ,p n ) T 
maximize (2) over (0, 00) x [pl,Pu]- Then 

1. a 2 / / p 2 l u — > Oq/ 1 Pq v almost surely, and 

2. ^{al/pf - al/pl v ) -> N{Q,2{al/pf) 2 ) in distribution 

under 67(0, a 2 K PQtU ). 

Proof. By assumption, pl < Pn < Pu for every n. Define two sequences, 
Cn(p L ) and Cn(pv), according to (3). By Lemma 1, c n (p L ) < c n (p n ) = 
°~nl Pri — Cn(pu) fo r all n with probability one. Combining this with Theo- 
rem 2 applied to c n {pL) and c n (pu) implies the result. □ 
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Theorem 3 is useful because it justifies the procedure that is most often 
adopted in practice, of allowing the range parameter to be estimated from 
data. Based on arguments following from Zhang (2004), a few authors have 
advocated for fixing the range parameter in practice. However, as we shall 
show in Section 3, the estimator c n {p\) can often display sizeable bias, mak- 
ing the approximation in Theorem 2 quite inaccurate. Confidence intervals 
constructed using Theorem 2 can, due to this bias, have empirical coverage 
probabilities very near to zero in some cases. In contrast, we will show that 
confidence intervals for the maximum likelihood estimator, constructed us- 
ing Theorem 3, have close to nominal coverage even for moderate sample 
sizes. 

2.2. Prediction at New Locations. We now consider the problem of pre- 
dicting the value of the process at a new location so n °t in the set of observa- 
tion locations S n . Stein (1988, 1990, 1993, 1999) has considered this problem 
when an incorrect model is used. Predictors under the wrong model can be 
consistent under relatively weak conditions. Our focus is therefore on two 
other desirable properties, asymptotic efficiency and asymptotically correct 
estimation of prediction variance. In a seminal paper, Stein (1988) showed 
that both of these properties hold when the model used is equivalent to the 
true measure. In fact, these properties hold uniformly over predictands that 
are linear functionals of Z or L-2 limits of such functionals, such as inte- 
grals of Z (Stein, 1990). In the case of the Matern covariance, Theorem 1 
indicates that this holds for a model with the correct v and microergodic 
parameter a 2 / ' p 2u ' . This has led to statements in the literature to the effect 
that "the parameter c = a 2 j p 2v can be consistently estimated, and this is 
what matters for prediction." While this statement contains an element of 
truth, we will argue in this section that it can also be somewhat misleading, 
both in an asymptotic sense, as well as in guiding choices for applications. 

Under the mean zero Gaussian process model with Matern covariance 
function and known v > 0, define 

(5) Z n (p) = 7„(p) T r n (p)" 1 Z n , 

where -f n (p) = {K(s - Si;p,u)}i and T n (p) = {K(si - sf,p,v)} for i,j = 
1, ... ,n. Z n (p) is the best linear unbiased predictor for Zq = Z(so) under a 
presumed model G(0,a 2 K P;U ) for any value of a 2 . The predictor itself does 
not depend on a 2 , only p and v. Therefore, any intuition that one can fix 
p = pi, and that as n increases, c will be better estimated by c n (pi), and 
for this reason plug-in predictions will improve, is clearly a misunderstand- 
ing of asymptotic results. Equivalence, although sufficient for asymptotic 
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efficiency, is not necessary. The way in which c is relevant for prediction 
concerns estimates of the mean squared error of the predictor. Under model 
G(0, UqK POjU ), this is 

(6) Var CToVo (Z n (p) - Z ) = a 2 [l - 2 ln (p) T T n (p)- 1 ln (p )+ 

7n(p) T r n (p) _1 r n (p )r n (p) _1 7n(p)] 

where j n (po) an d ^n(po) are defined similarly to their counterparts using p. 
In the case that p = po, this expression simplifies to 

(7) Var^Z^o) - Z ) = a 2 [l - ^(fiofTM'^ipo)]. 

In practice, it is common to estimate the model parameters and then plug 
them into (5) and (7), treating them as known. The asymptotic properties of 
this procedure, so-called "plug-in prediction," are quite difficult to obtain. 
Instead, most theoretical development has been under a framework in which 
model parameters are fixed and do not vary with n. We will review these 
results and indicate how they may be extended to include estimation of the 
variance parameter a 2 with a fixed value of p, making precise the sense in 
which the statement regarding c at the beginning of this section should be 
interpreted. Our simulation results indicate that prediction mean squared 
error and coverage of prediction intervals can be improved by estimating p 
as well. However, extending the asymptotic results to this case has thus far 
been intractable. We return to a discussion of this problem at the end of 
this section. 

The following result is an application of Theorems 1 and 2 of Stein (1993). 

Theorem 4. Suppose G(0, <JqK po V ) and G(0, a 2 K piV ) are two Gaussian 
process measures on D with the same value of u > 0. 

1. As n — > oo, 

Var a2, P0 (Znifil) - Z o) 
YaT af vPo \Zn{P0) - Z ) 

2. Furthermore, if a^/p 2 ^ = a 2 /p 2v , then as n—t oo, 

Va M.Pi {Zn{pl) ~ Z ) 

(P) j— r- -)■ 1. 

Var a ( >o [Z n ( Pl )-Zo) 

Proof. Let fo be the spectral density corresponding to OqK PQ)V and f\ be 
the spectral density corresponding to o\K pX)V . The result follows from noting 
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that the function /o( w )IM| 2 ^ +d is bounded away from zero and infinity as 
\\u>\\ — > oo and that 

lim m = dt#. 

IM|->°° f (u)) <?l/pl U 

These two conditions satisfy those needed for Theorems 1 and 2 of Stein 
(1993). □ 

The implication of Theorem 4 is that if the correct value of v is used, any 
value of p will give asymptotic efficiency. The condition o^jp 2 ^ = a \l P\ 
is not necessary for asymptotic efficiency, but it does provide asympotically 
correct estimates of risk. The numerator in (8) is the naive mean squared er- 
ror for Z n (a 2 ,pi), assuming model G(0, a 2 K pitV ), whereas the denominator 
is the true mean squared error for Z n (a 2 ,pi), under model G(0,<JqK POi i,). 
We now show the same convergence happens if p is fixed at p\ but a 2 is es- 
timated via maximum likelihood. This is an extension of part 2 of Theorem 
4. Part 1 needs no extension, since the form of the predictor itself does not 
depend on a 2 . 

Theorem 5. Suppose G(Q,o~qK P0>1/ ) is a Gaussian process measure on 
D. Fix p\ > 0. For a sequence of observations Z n on an increasing sequence 
of subsets S n of D, define a 2 = Z^T n (pi)~ 1 Z n /n. Then as n — > oo, 



Var ^,Pl (Zn(pl) ~ Z { 



almost surely under G(Q,o~qK POjV ). 

Proof. Define a 2 = al(p\ / ' Pq) 2u '. Then write 
Var^2 pi (z n ( Pl ) - Z ) Var^2 iPl (z„(pi) - Z ) Var* pi [Z n (pi) ~ 



Var ^,po (Zn(pi) ~ Zo) Var CT 2 pi (Z n (pi) - Z ) Var^ ^ [Z n {p x ) - Z J 

By Theorem 4, Var CT 2 pi {Z n (pi) - Z )/ Var CT 2 po (Z n (>i) - Z ) ->■ 1. So we 
need only show that Vaic^2 pi (Z n (pi) — Zq) / Vaic (T 2 >pi (Z n (pi) — Zq) — > 1 almost 
surely under G(0, a^K P(hU ). By (7), Var^2 )Pl (Z n (pi)-Z )/ Var^ (Z n (pi) - 
Zq) = a 2 /af. Under G(0, a 2 K pi)V ), a 2 is equal in distribution to cr 2 /n times 
a x 2 random variable with n degrees of freedom and hence converges almost 
surely to a 2 as n — > oo. Because o-q/p 2 ^ = a 2 /p 2u , Theorem 1 gives that 
G(0, aQK P0)V ) and G(0, a\K piyV ) are equivalent, so that a 2 — > a 2 almost 
surely under G(0, a^K^^) as well. □ 
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We conjecture that the asymptotic behavior in Theorem 4.1 and Theo- 
rem 5 still holds if p\ is replaced by p n , the maximum likelihood estimator, 
although proving such a result has thus far been intractable. The primary 
difficulty is that proofs for fixed p are able to exploit equivalence, while it 
is generally not the case that G(0,d^Kp nil/ ) and G(0,<JqK POiV ) are equiv- 
alent for any finite n. Putter and Young (2001) provide conditions under 
which both asymptotic efficiency and asymptotically correct risk estimation 
hold for plug-in predictions. They require that the sequence of measures be 
contiguous. However, they are only able to demonstrate contiguity under 
extremely limited conditions, including the Matern covariance with v = 1/2 
when the observations are equally spaced in one dimension. They note that 
verifying contiguity "in more general (parametric) covariance function mod- 
els however, such as the Matern model, as proposed by Stein (1999), may 
pose formidable problems." 

One approach we have considered is to prove that expression (6) for the 
actual mean squared error of Z{p) is quasi-convex in p. This would allow a 
similar method of proof as in Theorem 3. Although we have not been able 
to find a counter-example to this statement, we have also not been able to 
prove that it is true. Viewed as a function of the weight vector T n (p)~ ln / n (p), 
(6) is clearly convex. However, the weight vector does not satisfy any simple 
conditions that could be used to show quasi-convexity when it is viewed as 
a function of p. 

Despite these difficulties, we believe it is important to continue studying 
the full plug-in prediction problem. Simulation results in the next section 
indicate that a procedure that uses the maximum likelihood estimators can 
have important advantages over a procedure that fixes the range parameter. 

3. Simulation Study. Fixing the range parameter is supported by 
asymptotic results, and it is computationally efficient in practice, as expen- 
sive computations involving the correlation matrix only need to be carried 
out once. However, it is unclear to what degree asymptotic results are appro- 
priate in guiding our choices for applied problems with finite sample sizes. 
To systematically explore this issue, we simulate data under a Gaussian pro- 
cess model for a variety of settings chosen to mimic the range of behavior we 
might observe in practice, and we compare the performance of procedures 
that either fix or estimate the range parameter. In some cases the perfor- 
mance metrics can be calculated analytically rather than via simulation, as 
we will indicate. 

We simulate data under the mean zero Gaussian process model with 
Matern covariance and smoothness parameter v = 0.5 or 1.5 and marginal 



12 



C. G. KAUFMAN AND B. A. SHABY 



Effective range = 0.1 Effective range = 0.3 Effective range = 1 

p = 0.0334 p = 0.1 p = 0.334 




Fig 1. Simulated random fields on [0, l] 2 under parameter settings used in the simulation 
study. Fields were simulated using the Cholesky method, using the same set of random 
deviates for each panel. The value of the range parameter p corresponding to each v and 
effective range combination is also indicated. 



variance a 2 = 1. We also use three effective ranges for the process. That is, 
we choose values of p such that the correlation decays to 0.05 at these pre- 
scribed distances. The distances chosen as effective ranges are 0.1, 0.3, and 
1. (We also carried out an expanded version of this simulation study incor- 
porating the settings v = 1.5 and distances 0.05,0.2,0.6, and 2. The pattern 
of results was consistent with what we report here.) Figure 1 illustrates the 
effect of these parameter settings when simulating from a Gaussian process 
over the unit square. The datasets in Figure 1 are each generated using the 
Cholesky decomposition method (Cressie, 1993, Section 3.6.1), transforming 
the same set of independent standard normal random samples for each pa- 
rameter setting. As we shall see, whether a particular sample size is "large 
enough" to be approximated by asymptotic results depends on both the 
degree of smoothness and the effective range of the process. 

In addition to the effective range and smoothness, we also vary the sam- 
ple size in the simulation. In computing our performance metrics, we use 
datasets of size n = 100, 400, 900, and 1600. To avoid numerical issues that 
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can arise from sampling locations situated too close to each other, sampling 
locations are constructed using a perturbed grid. We construct a regular 
grid with coordinates from 0.005 to 0.995 in increments of 0.015 in each 
dimension. To each gridpoint, we add a random perturbation according to 
a uniform distribution over [-0.005, 0.005] 2 . The resulting set of 4489 loca- 
tions therefore has the property that all pairs of points have at least 0.005 
distance from each other. We then choose random subsets of these locations 
to be our n = 100, 400, 900, or 1600 observation locations, with each sample 
size containing the points from smaller sample sizes. In evaluating the pre- 
dictive properties of models fit using a fixed or estimated range parameter, 
we consider a 50 x 50 regular grid of locations over [0, l] 2 . 

For each parameter setting, we simulate 1000 datasets corresponding to 
realizations of the Gaussian process observed at the union of n = 1600 ob- 
servation and m = 2500 prediction locations. For each dataset and sample 
size, we estimate a 2 and p by numerically maximizing the likelihood. We also 
calculate <5" 2 (pi) = Z^T n (pi)~ l Z n /n for values of p\ equal to 0.2,0.5,1,2, 
and 5 times the true value of p. Corresponding to each of these parameter 
estimates, we also construct 95% confidence intervals for c = a 2 j p 2v using 
the normal approximation provided by Theorem 2 when p is fixed and The- 
orem 3 when p is estimated. Finally, we construct kriging predictors and 
estimated standard errors for each of the m = 2500 prediction locations by 
plugging in parameter estimates into (5) and (7). 

The next sections discuss the results for estimation and prediction. Many 
of the results show a similar pattern, which can be summarized as follows. 
The performance of the maximum likelihood estimator, maximizing over 
both a 2 and p, is generally very good, especially by n = 1600. Procedures 
using a fixed p are almost always worse, although there are certain settings 
under which the differences are minimal. These tend to be for v = 1/2 
(corresponding to processes that are not mean-square differentiable) and a 
large effective range. In these cases, particularly when p is fixed at something 
larger than its true value, the estimators and predictors can still perform 
well. This agrees with some examples in the literature, for which v = 1/2 and 
large effective ranges were used (Zhang and Wang, 2010; Wang and Loh, 
2011). When the process is smooth (y = 1.5) and/or the true range of 
spatial correlation is small, estimation and prediction is markedly improved 
by estimating p via maximum likelihood. 

3.1. Parameter estimation. Given the asymptotic results in Zhang (2004) 
and Wang and Loh (2011) for c n {p\) for fixed pi, it is tempting to adopt 
the intuition that this estimator can "adapt" to incorrectly specified values 
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Fig 2. Sampling distributions for c n when v = 1.5 and the effective range is 0.3. The 
range parameter is either fixed at the true value (Truth), estimated via maximum likelihood 
(MLE), or fixed at a multiple of the truth (0.2 Truth, . . . , 5 Truth). The four boxplots in 
each grouping correspond to sample sizes of n = 100, 400, 900, and 1600, reading from 
left to right. 



of p. While this is true asymptotically, we observe in our simulation results 
that this adaptation in many cases requires a very large value of n; sam- 
pling distributions can be highly biased and can move very slowly toward 
the truth as n increases. Figure 2 illustrates these effects for a subset of our 
simulation results, namely when v = 1.5 and the effective range is 0.3 (This 
corresponds to the center-bottom panel of Figure 1). Sampling distributions 
for c n (pi) are noticeably biased. As we expect based on Theorem 2, these 
biases decrease with n, although even when n = 1600 the true value of c 
lies far in the tail of the sampling distribution. In contrast, the sampling 
distributions for the maximum likelihood estimator c n (p n ) have negligible 
bias. Indeed, they behave very similarly to those for the estimator of c that 
fixes p at the truth. 

Similar effects can be seen for other values of v and effective range. Table 
1 shows the relative bias of different estimators of c. Within a given value of 
is, the effects are most apparent when the effective range is small. The bias 
is larger when v is 1.5 rather than 0.5. Fixing p at a value smaller than po 
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seems to have a more deleterious effect on bias than fixing p larger than pq. 
In all cases, including when the sample size n is large, estimating p results 
in notably better performance relative to fixing p. 

Table 1 

Bias in c„(p n ) relative to the true value of c, for the maximum likelihood estimator 
(MLE) and when p n is fixed at a multiple of the true value of p. Values labelled as 0.00 
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109.91 
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6.73 




1600 


3.41 
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0.20 
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29.70 


3.91 
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100 


0.94 


0.63 


0.18 
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1.15 
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5.49 


2.02 


0.40 




900 
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0.20 




1600 


0.51 
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0.02 


3.29 


0.71 
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2p 


100 
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-0.05 
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-0.16 




400 


-0.24 


-0.08 
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-0.24 


-0.06 




900 


-0.16 
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-0.04 


-0.01 
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If Theorem 2 is used to construct confidence intervals and n is in fact not 
large enough for the normal approximation to be appropriate, the coverage 
of such intervals can be disastrously low. This is clearly the case for fixed p\ 
in Figure 2, for which the sampling distributions are extremely biased even 
when n = 1600. Empirical coverage rates for confidence intervals constructed 
using c n (pi) and c n (p n ) are shown in Table 2. These were constructed as 
Cn(/)n)± 1-96 y / 2c n (p n ) 2 /n for p n equal to the maximum likelihood estimator 
or a fixed p\. Combining parts 1 and 2 of Theorems 2 and 3 implies that 
these intervals are asymptotically valid 95% confidence intervals for c. Not 
surprisingly, however, given the large biases observed when p is fixed, the 
differences in the empirical coverage rates between fixed and estimated p 
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are striking, even when n is large. In many cases the coverage for intervals 
constructed using c n (pi) was 0%, to within Monte Carlo sampling error. The 
results across different values of v and effective range mimic the pattern for 
the bias, with coverage being best when v is small and effective range is 
large. For fixed p±, it also appears better to choose p\ > po than p\ < pq. 

Another conclusion we draw from this table is that the sample size needed 
for the asymptotic approximation in our Theorem 3 to be appropriate is 
quite a bit smaller than is needed for the equivalent result from Wang and Loh 
(2011) when p is fixed. This does not contradict the simulation results in 
Wang and Loh (2011). These authors present results indicating that quan- 
tiles of the sampling distribution for c n (pi) are well matched by asymptotic 
approximations, but their data were simulated only for small values of v (0.25 
and 0.5) and large effective ranges (2.8 and 3.7). Intuitively, such a pattern 
of results is not surprising, because as discussed in Zhang and Zimmerman 
(2005), larger effective ranges can be thought of as corresponding better to 
the infill asympotic framework. However, given that the performance of the 
maximum likelihood c n (p n ) is superior across a range of parameter settings, 
often markedly so, the additional computational expense of maximizing the 
profile likelihood for p seems a worthwhile investment. 

3.2. Prediction. The mean squared error of predictor Z n {p\) may be cal- 
culated in closed form using (6), rather than relying on simulated data. When 
the plug-in predictor Z n (p n ) is used, we need to integrate over the sampling 
distribution for p n , which we can do using the simulation results from Sec- 
tion 3.1. For both fixed and estimated p, we calculate the average mean 
squared prediction error, averaging over the m = 2500 prediction points in 
the domain. Because the prediction problem varies in difficulty according to 
v, effective range, and sample size n, we report the percent increase in mean 
squared prediction error relative to the optimal mean squared prediction 
error using the true value of p, which is calculated from (7). These results 
are shown in Table 3. 

Looking at Table 3, it is clear that plug-in prediction using the maximum 
likelihood estimator p n performs quite well relative to predicting with the 
true value of p. For n = 900 and 1600, the increase in mean squared error 
is less than 0.1 percent in all cases. It is also clear that there are cases in 
which it makes little difference if p is fixed at an incorrect value. This is 
true when the effective range is large (er = 1) and p\ is fixed at something 
larger than the true value. However, there are also cases in which fixing p 
can lead to quite a large loss of efficiency. These effects are magnified when 
we move from v = 0.5 to v = 1.5, suggesting that a misspecified value of 
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Table 2 

Empirical coverage rates of confidence intervals for c = a 2 / p 2u , expressed as percentages. 
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p is more problematic for smoother processes. This aligns with some earlier 
cases in the literature in which predictions with a fixed p were still quite 
accurate. For example, Zhang and Wang (2010) examined precipitation data 
using a predictive process model (Banerjee et al., 2008) and concluded that a 
variety of prediction metrics did not change when p was fixed at values larger 
than the maximum likelihood estimator. However, the underlying covariance 
model for the predictive process was Matern with v = 1/2, corresponding to 
a process that is not mean square differentiable. The model also incorporated 
a mean term and measurement error, which is beyond the scope of what we 
consider here. 

Table 4 shows the empirical coverage rates of prediction intervals con- 
structed using the naive variance estimator in (7), replacing po with either 
pi or p n . In calculating these rates, we average over both simulation replica- 
tions and over prediction locations. In a similar pattern to what we observe 
for mean squared error in Table 3, using the maximum likelihood estimator 
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Table 3 

Percent increase in mean squared prediction error relative to the optimal mean squared 
prediction error using the true value of p. Values labelled as 0.0 should be interpreted to 

mean less than 0.1 percent. 
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produces intervals with the nominal rate in nearly all cases, and the esti- 
mators fixing p at something larger than the true value achieve this rate for 
n = 900 and 1600 when the effective range is large (er = 1). However, the 
intervals tend to be too conservative when the effective range is large and p\ 
is too small, and they tend to be not conservative enough when the effective 
range is small and p\ is too big. Another striking finding is that coverage 
can sometimes become worse initially as n increases. See for example the 
entries in Table 4 for which v = 1.5, p n = 0.2p, and the effective range is 
0.3. Although we know from Theorem 4 that this coverage will eventually be 
close to nominal, the sample size needed for this to occur may be quite large, 
and coverage can get worse with n before it improves, making it difficult to 
judge when fixing p is "safe" in practice. 
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Table 4 

Empirical coverage rates of prediction intervals, expressed as percentages. 
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4. Covariance tapering. We return now to asympotic theory in a 
slightly different but related setting, showing that our approach in Section 
2.1 is not limited to the maximum likelihood estimator. The method of 
covariance tapering has been suggested as a means of coping with computa- 
tional difficulties posed by large spatial datasets (Purrer, Genton and Nychka, 
2006; Kaufman, Schervish and Nychka, 2008). The likelihood function (2) 
may be replaced by 
(10) 

£n,tap(° 2 >P) = (2v<J 2 )- n/2 \T n (p) o TX 1 ' 2 exp j- J-^J[r n (p) o TJ-^J 

where T n is a correlation matrix formed using a correlation or "taper" func- 
tion that has compact support and Ao B indicates the direct or Hadamard 
product between matrices A and B of the same dimension. Because taper- 
ing can introduce sparsity, maximizing (10) has computational advantages 
over maximizing the original likelihood function. For a fixed value of p, (10) 
is maximized by & 2 tap (p) = Z^[T n (p) o T n ]~ 1 Z n /n, and the corresponding 
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estimator of c is 

(11) c nMp (p) = al tap (p)/ P 2v = Z' n [T n (p) o T n ]- l Z n /{np 2v ). 

Under certain conditions on the taper function, c n) t ap (p) is strongly con- 
sistent and has the same asymptotic normal distribution as in Theorem 2 
(Kaufman, Schervish and Nychka, 2008; Du, Zhang and Mandrekar, 2009; 
Wang and Loh, 2011). As was the case for the original likelihood, all of 
these results considered a fixed value of p. However, the same argument 
used in the proof of Lemma 1 can be used to show that c n j ap (p) is a mono- 
tone function of p, allowing us to extend the asymptotic results to the case 
that (10) is maximized over both a 2 and p. 

Lemma 2. Let S n = {s\, S2, ■ ■ ■ , s n G -D C K d } denote any set of ob- 
servation locations in any dimension. Fix v > and define T n (p) to be 
the matrix with entries K{si — Sj]p,v) defined using the Matern correla- 
tion function and T n to be any positive semi-definite matrix. Then for any 
< pi < p 2 , c nMp {pi) < Cn,ta P (pi) for any vector Z n . 

Proof. The proof is similar to that for Lemma 1. Replace the matrix 
B by [p2 U T n (p2) — p 2u T n (pi)} o T n and use the fact that the direct prod- 
uct of two positive semi-definite matrices is again positive semi-definite 
(Horn and Johnson, 1991, Theorem 5.2.1). □ 

This can be used to extend Theorems 1 and 2 of Wang and Loh (2011) to 
cover the case that p is estimated by maximizing the tapered likelihood. 
(Special cases of Theorems 1 and 2 of Wang and Loh (2011) appear as 
Theorem 2 of Kaufman, Schervish and Nychka (2008) and Theorem 5.ii of 
Du, Zhang and Mandrekar (2009).) 

Theorem 6. Let S n be an increasing sequence of subsets of D. Suppose 
(o"o'Po) T G (0,oo) x [p L ,Pu], for any < p L < pu < oo. Let K tap be an 
isotropic correlation function with support [—1, l] d whose spectral density 

ftapH = j^y d J^e~^ Tx K tap (x)dx 

satisfies 

f ( — 

Jtap{W) ^ + ^ 2 y+d/2+e 

for some constants e > max{d/4, 1 — z/} and M > 0. Define a sequence of 
positive taper ranges 7 n = Cn~ a for < C < 1 and a > 0. For each n, let 
T n be the n x n matrix with entries K tap (\\si — Sj\\/~/ n ) and let <7^ tap and 
Pn,tap maximize (10) over over (0, oo) x [pL,pu]. 
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1. Suppose < a < l/(4f + 4e + 2d). Then as n — > oo 



tap/ A 1 n,tap 




2u 



almost surely under 67(0, crj 




2. Suppose < a < l/(8z; + 8e + 2j and 2a(2z^ + 2e + d)/ min{2, 4- d, 4e- 
d, 4f + d} < (1 — 2ad) / (2d) . Then as n — >■ oo, 



m distribution under G(0,o-qK PO)1 ,). 

Proof. By assumption, p^, < /5„ jtap < p/j for every n. Define two se- 
quences, c n ,tap{pL) and c nt tap{pu), according to (11). By Lemma 2, c nMp (pL) < 
Cn,tap(Pn,tap) = °nl P%,tap ^ Cn,ta P {pu) for all n with probability one. Combin- 
ing this with Theorems 1 and 2 of Wang and Loh (2011) applied to c n ,tap{PL) 
and c n ,tap(pu) implies the result. □ 

Remark 1. The case a = corresponds to a constant taper range C, 
for which the conditions on a are trivially satisfied. 

Simulation results in Kaufman (2006) indicate that the estimator <j\ tap j p^ tc 
can be noticeably biased in finite samples. This is in contrast to the re- 
sults for the maximum likelihood estimator shown in Section 3. Our un- 
derstanding of this result is that (10) essentially imposes a shorter effec- 
tive range through the taper function. Therefore, the same caution we have 
urged in interpreting results that fix the range parameter should be ap- 
plied to Theorem 6 as well. Also, the simulation results in Kaufman (2006) 
were for a fixed taper range (a = 0). The bias is likely to decrease even 
more slowly with n if the taper range is allowed to decrease as n increases. 
Kaufman, Schervish and Nychka (2008) introduced another approximation 
using tapering that does not suffer from this bias, although existing asymp- 
totic results for it do not cover the infill case (Shaby and Ruppert, 2011). 

5. Discussion. We have argued for the importance of estimating the 
range parameter in geostatistical models, even when asymptotic results can 
be obtained while holding it fixed. We have shown for the widely-used 
Matern model that the same asymptotic behavior can be obtained for an 
estimator of the consistently estimable parameter c = a 2 / p 2u when the vari- 
ance parameter a 2 and range parameter p are jointly estimated via maxi- 
mum likelihood, rather than using a fixed value of p. As demonstrated in our 




a 2 /p 2 n^N(0,2(a 2 /p 2 n 2 ) 
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simulation study, this can lead to dramatic reductions in bias and improve- 
ments in empirical coverage rates of confidence intervals for this parameter. 
We also clarified the conditions for asymptotic efficiency of predictions un- 
der this model, highlighting a common misunderstanding of existing results 
that fix the range parameter. We extended asymptotic results for prediction 
to allow the variance parameter to be estimated, although results for plug- in 
prediction using the maximum likelihood estimator for both a 2 and p are 
still elusive. However, simulation results support the importance of estimat- 
ing the range parameter, both in estimating c and in carrying out plug-in 
prediction. This is particularly evident when the process is smooth (larger 
u) or the effective range of the process is small. 

We have made a number of simplifying assumptions. Considering the ways 
in which these assumptions may be relaxed provides a rich set of questions for 
future research. For example, our results concern only mean zero Gaussian 
processes, which is equivalent to assuming that the mean of the process 
is known. Results on equivalence of mean zero Gaussian measures such as 
Theorem 1 (Zhang, 2004) can be used in proving equivalence of Gaussian 
process measures with different means (Stein, 1999, Chapter 4, Corollary 
5). Demonstrating this basically requires one to show that the difference 
between mean terms be sufficiently smooth (Yadrenko, 1983, page 138). 
However, the primary difficulty is in extending estimation results. Zhang 
(2004) indicates that his method of proof is not easily extended to the case 
of an unknown mean term. Asymptotic results for the case v =1/2 and d = 1 
are given in Theorem 3 of Ying (1991), and it seems plausible that similar 
results might hold for d = 2 and 3. A more direct route than the method of 
proof used in (Zhang, 2004) might be to characterize the distribution of the 
quadratic form in the expression for the maximum likelihood estimator <7 2 . 

We have also not considered what happens when the observations are not 
of the process Z itself, but of Z observed with measurement error. Again, re- 
sults for equivalence and prediction can be extended in a relatively straight- 
forward way. We expect something like Theorem 3 should hold for the case 
that Z is observed with measurement error. However, in a restricted version 
of this problem, the rate of convergence is known to change when measure- 
ment error is also included. Specifically, Chen, Simpson and Ying (2000) 
extended the results of Ying (1991) for the case v = 1/2 and d = 1 (the 
Ornstein-Uhlenbeck) process to include the presence of a measurement error 
term. The introduction of the error term reduces the rate of convergence of 
the maximum likelihood estimator for c from the usual order n -1 / 2 to order 

Perhaps the most important restriction, both here and in previous work, 
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is that the smoothness parameter v is assumed to be known. Estimating 
v provides desirable flexibility, as this parameter controls the mean square 
differentiability of the process. However, we know of no results concerning 
the maximum likelihood estimator in this case. Stein (1999, Section 6.7) 
examines a periodic version of the Matern model and argues that a\ and 
v n should have a joint asymptotic normal distribution, but it is an open 
question whether a similar result holds for non-periodic fields. Proving such 
a result will likely require a new approach, as the method of proof for showing 
equivalence used by Zhang (2004) cannot be extended to the case o^/ Pq U ° = 
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